Skip to content

optimize computation in librrgraph utils #2714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 17, 2024
Merged

optimize computation in librrgraph utils #2714

merged 2 commits into from
Sep 17, 2024

Conversation

heshpdx
Copy link
Contributor

@heshpdx heshpdx commented Sep 5, 2024

Quick performance fix. In one hot loop, swap three expensive FDIVs for a single FDIV and three FMULs. Most CPUs execute FP multiplies magnitudes faster than FP divides.

No functional change seen, at least in my benchmarks.

In one hot loop, swap three expensive FDIVs for a single FDIV and three FMULs.
@github-actions github-actions bot added the lang-cpp C/C++ code label Sep 5, 2024
@AlexandreSinger
Copy link
Contributor

@vaughnbetz I think this is ready to merge. It LGTM.

@vaughnbetz
Copy link
Contributor

Thanks!

@vaughnbetz vaughnbetz merged commit f43e070 into verilog-to-routing:master Sep 17, 2024
53 checks passed
@vaughnbetz
Copy link
Contributor

I wonder if the compiler would have factored this out ? No harm in any case to merge this ...

@AlexandreSinger
Copy link
Contributor

AlexandreSinger commented Sep 17, 2024

@vaughnbetz I agree, I also think the compiler should handle this. The only thing that may not cause the compiler to do this optimization is precision. Division may yield a different result than multiplication by (1/x); therefore, if safe math compiler optimizations are turned on, the compiler cannot re-order the math operations. Since @heshpdx is working on the SPEC benchmark, it may be that safe math compiler optimizations are turned on to always give the same output regardless of optimization. That's my guess.

As far as I can tell, VTR does not turn on the fast-math optimization in GCC; so it should not reorder do anything in the floating point math which may yield different solutions. See: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-ffast-math
Screenshot from 2024-09-17 16-44-31

@vaughnbetz
Copy link
Contributor

Good point; that makes sense.

@heshpdx
Copy link
Contributor Author

heshpdx commented Sep 17, 2024

Thank you for merging!

If num_switches were declared const, then the compiler might have a chance. But since it is not, the divide instructions are output every time. People worry too much about FP precision; in general this rarely leads to significant control flow changes. People also say "teach the compiler the cost model for this", and that is happening but mostly guarded under -mcpu=native type flags, and if you don't have the latest compiler you are out of luck. So I recommend fixing the software because it can produce an impact immediately and backwards for older compilers.

For SPEC CPU, we do not build VPR with -Ofast because VPR uses std::isnan so everything would break. There may be some subset of fast math options that are safe, but we haven't explored those yet. The code in question also produces divides with -O3 which is my default for testing. We offer a small amount of tolerance in output which would accommodate such changes, if they actually did produce a different answer (in this case they didn't).

This type of code is surprisingly common in the field, especially inside of for-loops. Take a look at my recent github history and you can see me tearing through applications offering this same perf optimization in droves. 😄

@heshpdx
Copy link
Contributor Author

heshpdx commented Sep 18, 2024

We offer a small amount of tolerance in output which would accommodate such changes, if they actually did produce a different answer (in this case they didn't).

Actually, I should correct this statement. A month ago I stopped offering tolerance! All outputs for the SPEC CPU version of VPR now much match exactly regardless of compiler or arch, and indeed they do match through testing across a variety of systems. And this patch did not move the needle.

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Sep 18, 2024

That's good news! Is this with Amin's change to read in initial placement and use a power of 2 channel width in placement?
@amin1377

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants